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ABSTRACT 

In  this  paper  we  present  Carnegie  Mellon  University’s  sub¬ 
mission  to  the  TREC  2009  Relevance  Feedback  Track.  In 
this  submission  we  take  a  classification  approach  on  docu¬ 
ment  pairs  to  using  relevance  feedback  information.  We  ex¬ 
plore  using  textual  and  non-textual  document-pair  features 
to  classify  unjudged  documents  as  relevant  or  non-relevant, 
and  use  this  prediction  to  re-rank  a  baseline  document  re¬ 
trieval.  These  features  include  co-citation  measures,  URL 
similarities,  as  well  as  features  often  used  in  machine  learn¬ 
ing  systems  for  document  ranking  such  as  the  difference  in 
scores  assigned  by  the  baseline  retrieval  system. 

1.  INTRODUCTION 

Retrieval  systems  employing  relevance  feedback  techniques 
typically  focus  on  augmenting  the  representation  of  the  in¬ 
formation  need  in  order  to  improve  performance.  This  is 
typically  done  through  adding  or  re-weighting  terms  in  the 
query  representation,  and  have  been  shown  to  be  effective 
techniques  in  the  past  [4,  7,  8,  13].  These  techniques,  how¬ 
ever,  are  typically  limited  to  the  information  need  represen¬ 
tation  used  in  the  baseline  retrieval  system  and  generally 
don’t  utilize  information  beyond  the  word  distributions  in 
the  feedback  documents  to  modify  the  query  model. 

This  paper  describes  the  CMU  submission  to  the  TREC 
2009  Relevance  Feedback  Track.  With  this  submission,  our 
goal  is  to  explore  techniques  beyond  query  term  re- weighting 
and  other  traditional  approaches  to  query  expansion.  Our 
approach  constructs  pairwise  features  between  judged-relevant 
feedback  documents  and  unjudged  documents,  and  then  ap¬ 
plies  a  learned  classifier  to  identify  those  unjudged  docu¬ 
ments  likely  to  be  relevant.  The  output  of  this  classification 
is  then  used  to  re-rank  an  initial  document  ranking,  favoring 
those  documents  predicted  to  be  relevant  to  the  query. 

2.  SYSTEM  DESCRIPTION 

The  CMU  submission  system  consists  of  four  main  compo¬ 
nents:  baseline  retrieval,  document  selection,  relevance  clas¬ 
sification  and  document  re-ranking.  The  document  selection 
and  relevance  classification  components  of  the  system  take 
a  machine  learning  approach,  using  a  feature  space  derived 
from  document  pairs. 

This  section  describes  these  four  components  in  the  CMU 
relevance  feedback  track  submission,  as  well  as  this  feature- 
based  document-pair  representation. 

2.1  Baseline  Retrieval 


For  these  experiments,  we  use  Indri  for  our  baseline  rank¬ 
ing1.  Indri  has  been  shown  to  perform  well  in  ad-hoc  re¬ 
trieval  tasks  at  TREC  in  previous  years  [8,  10].  For  these 
experiments  we  made  use  of  a  small  standard  stop-word 
list  and  applied  the  Krovetz  stemmer.  We  constructed  full- 
dependence  model  queries  from  the  query  text  [9].  Smooth¬ 
ing  parameters  were  taken  directly  from  previously  pub¬ 
lished  TREC  configurations2. 

Initial  informal  experiments  with  pseudo-relevance  feed¬ 
back  (PRF)  with  relevance  models  [7]  indicated  that  tradi¬ 
tional  approaches  to  query  expansion  may  be  less  effective  on 
the  ClueWeb09  collection  due  to  the  susceptibility  of  those 
techniques  to  the  web-spam  present  in  the  collection.  For 
this  reason  we  did  not  use  PRF  in  our  baseline  run. 

2.2  Document  Representation 

We  take  a  machine  learning  approach  to  the  document 
selection  and  relevance  classification  components  of  our  sys¬ 
tem.  These  components  use  a  common  document  represen¬ 
tation  scheme,  described  below. 

2.2.1  Pairwise  Representation 

Our  feature-based  representation  constructs  feature  vec¬ 
tors  for  each  pair  of  documents  retrieved  by  the  baseline 
retrieval  for  a  given  query. 

Dq  —  {dql,  dq 2,  . . . ,  dqn\ 

Pq  =  |{ f{dqi,dqj)\i,j  £  {1,  .  .  .  ,  R} ,  1  ^  j} 

Dq  are  the  R  documents  retrieved  for  query  q,  Pq  are  the 
document  pair  vectors  defined  by  f  :  Dq  x  Dq  — >  RM,  a 
vector  feature  function  over  document  pairs: 

i(di,dj)  =  </o(df,  dj),  fi(di,  dj), ...,  fM{di,  dj)) 

where  each  /*,  are  instantiations  of  individual  features  de¬ 
rived  from  the  document  pairs. 

This  representation  allows  use  of  some  features  that  can  be 
difficult  to  integrate  into  traditional  retrieval  systems  that 
exclusively  use  term-weighting  for  estimating  relevance.  As 
we  describe  below,  many  of  our  features  cannot  be  mod¬ 
eled  with  a  bag-of-words  document  representation.  Using 
a  pairwise  representation  also  allows  a  “query  by  example” 
approach  to  leveraging  the  feedback  information.  We  make 
the  assumption  that  relevant  documents  tend  to  be  simi¬ 
lar  to  each  other,  viz.  the  cluster  hypothesis  [12].  Thus, 
using  pairwise  features  that  describe  document  similarities 

:http : //www. lemurproject . org/indri 

2http : //ciir . cs . umass . edu/  metzler/ indri-tb05 . tgz 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

NOV  2009 


2.  REPORT  TYPE 


3.  DATES  COVERED 

00-00-2009  to  00-00-2009 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 


5c.  PROGRAM  ELEMENT  NUMBER 


5d.  PROJECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


4.  TITLE  AND  SUBTITLE 

Pairwise  Document  Classification  for  Relevance  Feedback 


6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES)  8.  PERFORMING  ORGANIZATION 

Carnegie  Mellon  University, Language  Technologies  report  number 

Institute, Pittsburgh, PA, 15213 

9.  SPONSORING/MONITORING  AGENCY  NAME(S )  AND  ADDRESS(ES )  10.  SPONSOR/MONITOR' S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

Proceedings  of  the  Eighteenth  Text  REtrieval  Conference  (TREC  2009)  held  in  Gaithersburg,  Maryland, 
November  17-20,  2009.  The  conference  was  co-sponsored  by  the  National  Institute  of  Standards  and 
Technology  (NIST)  the  Defense  Advanced  Research  Projects  Agency  (DARPA)  and  the  Advanced 
Research  and  Development  Activity  (ARDA). 

14.  ABSTRACT 

see  report 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

6 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


(or  dissimilarities),  the  goal  of  our  approach  is  to  find  other 
relevant  documents  similar  to  those  that  have  been  judged. 


2.2.2  Features 

The  fourteen  document-pair  feature  functions  ( fk(di,dj )) 
used  in  these  experiments  are  described  below.  These  fea¬ 
tures  are  generally  intended  to  capture  different  types  of 
similarity  (or  dissimilarity)  between  two  documents.  Many 
of  these  features  are  computed  with  the  Jaccard  coefficient, 
a  measure  of  similarity  of  two  sets  of  objects.  The  Jaccard 
coefficient  of  two  sets  A  and  B  is  given  by: 


J(A,B) 


\ArB\ 

\AUB\ 


(1) 


1.  Document  features 


(a)  Length:  The  absolute  value  of  the  difference  in 
the  lengths  of  di  and  dj. 

2.  URL  features 

(a)  URL  Depth:  The  absolute  value  of  the  differ¬ 
ence  in  the  depth  (number  of  occurrences  of  ‘/’) 
in  the  URLs  of  di  and  dj. 

(b)  URL  Host:  The  Jaccard  coefficient  computed 
over  overlapping  character  4-grams  in  the  URL 
hostnames  of  d;  and  dj. 

(c)  URL  Path:  The  Jaccard  coefficient  computed 
over  overlapping  character  4-grams  in  the  URL 
paths  of  di  and  dj. 

3.  Webgrapli  features3 

(a)  In-link:  The  absolute  value  of  the  difference  in 
the  number  of  in-links  to  di  and  dj. 

(b)  Out-link:  The  absolute  value  of  the  difference  in 
the  number  of  out-links  from  di  and  dj. 

(c)  Co-citation:  The  Jaccard  coefficient  computed 
over  the  set  of  documents  that  link  to  di  and  dj. 

(d)  References:  The  Jaccard  coefficient  computed 
over  the  set  of  documents  that  di  and  dj  link  to. 

4.  Query-derived  features 

(a)  Unigram  count:  The  absolute  value  of  the  dif¬ 
ference  in  the  count  of  query  tokens  in  di  and  dj . 

(b)  Ordered  bigram  count:  The  absolute  value  of 
the  difference  in  the  count  of  ordered  query  bi¬ 
grams  in  di  and  dj. 

(c)  Unordered  bigram  count:  The  absolute  value 
of  the  difference  in  the  count  of  unordered  query 
bigrams  in  di  and  dj. 

(d)  Unigram  score:  The  absolute  value  of  the  dif¬ 
ference  in  Indri  score  of  the  unigram  component 
of  the  baseline  dependence  model  query. 

(e)  Ordered  window  score:  The  absolute  value  of 
the  difference  in  Indri  score  of  the  ordered  win¬ 
dow  component  of  the  baseline  dependence  model 
query. 

3 All  webgraph  features  were  computed  with  the  use 
of  the  WebGraph  software  package,  available  from 

http :  //webgraph.  dsi  .unimi  .  it/  [3]. 


(f)  Unordered  window  score:  The  absolute  value 
of  the  difference  in  Indri  score  of  the  unordered 
window  component  of  the  baseline  dependence 
model  query. 

All  features  are  normalized  to  have  zero-mean  unit-variance 
per  query  prior  to  training  and  testing. 

2.3  Relevance  Classification 

We  can  use  the  above  document  pair  representation  scheme 
to  train  a  classifier  that  predicts  whether  unjudged  docu¬ 
ments  are  relevant  or  non-relevant  given  some  judged  doc¬ 
uments.  We  make  the  assumption  that  relevant  documents 
are  likely  to  be  similar  to  each  other,  and  dissimilar  to  non- 
relevant  documents  with  respect  to  the  features  defined  in 
Section  2.2.2.  In  contrast,  we  make  no  assumption  about 
the  similarity  of  non-relevant  documents  to  each  other. 

We  train  this  classifier  on  a  set  of  queries  with  known 
relevant  and  non-relevant  documents.  Let  the  set  of  (binary) 
judgements  for  a  given  training  query,  q  be: 

Jq  =  {(dqi,rqi)  |  rqi  £  {0, 1}} 

where  rqi  =  1  indicates  the  document  dqi  is  relevant  for 
query  q,  and  rqi  =  0  indicates  the  document  is  non-relevant. 

We  train  a  logistic  regression  classifier  on  judged  docu¬ 
ment  pairs,  letting  yqij  £  {0, 1}  indicate  the  class  label  of 
the  pair  ( dqi ,  dqj).  This  training  set  is  constructed  as  follows: 

JPq  =  {(f(dqi,dqj),yqij)  |  rqi  =  1  ',yqij  =  rqj} 

so  that  each  pair  of  training  examples  has  at  least  one  judged 
relevant  document  (dqt).  The  judgement  on  the  other  doc¬ 
ument  (dqj)  indicates  whether  this  pair  is  a  positive  or  neg¬ 
ative  training  example.  Thus,  the  classifier  is  trained  to 
assign  a  positive  (1)  classification  to  relevant /relevant  docu¬ 
ment  pairs,  and  a  negative  (0)  classification  to  relevant/non- 
relevant  pairs.  The  result  of  this  training  produces  a  classi¬ 
fication  function  h  :  Dq  x  Dq  — >  [0, 1],  where  a  value  close 
to  1  indicates  a  positive  classification,  and  a  value  close  to 
0  indicates  a  negative  classification. 

After  feedback  judgements  are  collected,  assuming  some 
of  the  feedback  documents  are  relevant,  we  can  apply  the 
learned  classifier  to  predict  whether  or  not  unjudged  doc¬ 
uments  are  relevant  or  non-relevant.  For  each  unjudged 
document  dqj,  we  make  a  relevance  prediction  given  all  the 
judged  relevant  documents:  {h(dqi,  dqj)V  dqi  s.t.  rqi  =  1}. 
This  set  of  predictions  can  be  combined  in  several  ways  to 
form  a  final  relevance  classification,  for  example  taking  the 
mean,  minimum,  or  maximum  value  across  the  predictions. 
Preliminary  experiments  with  the  TREC  2009  Relevance 
Feedback  Track  data  showed  that  taking  the  maximum  pre¬ 
diction  value  across  all  the  judged  relevant  documents  gen¬ 
erally  yielded  the  best  performance.  Thus,  we  define  our 
final  prediction  for  an  unjudged  document  as  follows: 

n(dqj)  =  max  h(dqi,dqj ) 

dqi£Jq-,rqi  =  1 

This  relevance  prediction  effectively  classifies  unjudged 
documents  based  on  their  similarity  to  the  closest  judged 
relevant  feedback  document  with  respect  to  the  feature  space 
defined  above.  Because  of  this,  it  is  critical  to  collect  rel¬ 
evance  judgements  on  a  diverse  set  of  documents  in  order 
to  maximize  the  chance  of  identifying  relevant  documents 
similar  to  possibly  relevant  but  unjudged  documents. 


Note  that  judged  non-relevant  documents  are  used  for 
training  the  model,  but  are  not  used  at  prediction  time  after 
collecting  feedback  judgements.  Methods  of  using  these  non- 
relevant  feedback  documents  is  an  area  for  future  refinement 
of  the  models  presented  here. 


2.4  Document  Re-Ranking 

We  use  the  output  of  the  above  relevance  classifier  n  to 
re-rank  the  documents  retrieved  with  the  baseline  ranking 
algorithm.  Due  to  the  difficulty  of  re-scaling  Indri’s  language 
modeling  score  and  the  output  of  a  logistic  regression  clas¬ 
sifier,  we  chose  to  combine  scores  using  a  rank-based  voting 
method,  Borda  Count  [1] .  Rather  than  combining  the  scores 
of  the  baseline  ranker  and  the  logistic  regression,  Borda 
Count  linearly  combines  the  ranks  of  the  documents  from 
each  of  these  components.  Although  this  method  ignores 
the  magnitude  of  the  confidence  of  the  prediction  output,  it 
avoids  the  need  to  re-scale  the  scores  to  be  comparable. 

We  use  a  weighted  version  of  Borda  Count  in  these  experi¬ 
ments  to  adjust  the  relative  influence  of  the  baseline  ranking 
score  and  the  relevance  prediction  output.  This  weight  is  se¬ 
lected  to  maximize  Mean  Average  Precision  via  a  grid  search 
on  the  same  training  data  used  to  train  the  relevance  clas¬ 
sifier.  For  these  experiments,  we  selected  a  weight  of  0.3  on 
the  relevance  classifier  and  0.7  on  the  baseline  ranking. 


2.5  Document  Selection 

The  final  component  of  our  system  is  the  document  selec¬ 
tion  system.  As  pointed  out  earlier,  diversity  is  a  critical  fac¬ 
tor  underlying  our  document  selection  approach.  The  classi¬ 
fication  method  in  Section  2.3  gives  a  probabilistic  measure 
of  the  relevance  of  an  unjudged  document  paired  with  a 
judged  relevant  document.  The  final  relevance  score  of  an 
unjudged  document  is  then  the  maximum  value  assigned 
across  all  the  judged  relevant  documents  for  that  query. 
Having  similar  judged  relevant  documents  agree  on  the  rele¬ 
vance  of  an  unjudged  document  is  not  as  effective  as  having 
agreement  across  a  diverse  committee.  Thus,  this  is  the 
main  focus  of  our  selection  mechanism. 

The  most  naive  approach  is  to  select  the  top  5  documents 
for  feedback.  However,  it  is  often  the  case  that  top  doc¬ 
uments  are  similar  to  each  other.  Learning  the  relevance 
level  of  similar  documents  might  improve  the  ranking  for 
additional  similar  documents,  but  it  might  not  generalize 
to  a  larger  set  of  documents.  The  diversity  factor  has  been 
investigated  in  the  active  learning  literature  [5,  11].  It  is 
indicated  that  choosing  the  unlabeled  examples  which  are 
representative  of  the  underlying  data  distribution  boosts  the 
performance.  Hence,  we  focus  in  this  section  to  select  doc¬ 
uments  that  are  likely  to  be  relevant  and  also  different  from 
each  other.  Specifically,  we  adopted  a  clustering  framework 
where  we  cluster  the  unjudged  documents  using  the  Fuzzy 
Clustering  algorithm  [2,  6]. 

The  objective  of  fuzzy  clustering  is  to  spread  out  each 
example  into  various  clusters.  In  other  words,  each  exam¬ 
ple  has  a  degree  of  belonging  to  clusters,  rather  than  com¬ 
pletely  belonging  a  single  cluster.  Hence,  it  is  a  soft  clus¬ 
tering  method  instead  of  hard  clustering.  For  each  point  x, 
there  is  a  corresponding  coefficient  indicating  the  degree  of 
belonging  to  the  kth  cluster;  i.e.  Uk(x).  However,  the  sum 


of  the  coefficients  for  any  given  point  x  is  equal  to  1. 

K 

^2uk(x)  =  lVx  (2) 

k= 1 


Furthermore,  the  degree  of  belonging  uk(x)  (or  the  mem¬ 
bership  coefficient)  is  inversely  related  to  the  distance  of  the 
point  to  the  cluster  center  center  k' 


Uk(x) 


1 

cL{centerk,x) 


(3) 


Hence,  points  further  away  from  the  center  of  the  cluster 
have  a  lower  degree  of  belonging  than  the  points  closer  to 
the  center.  The  cluster  center  is  calculated  using  the  mean 
of  all  points,  weighted  by  their  membership  coefficients: 


centerk  = 


J2g-Uk(x)fX 

E  xUk(x)f 


(4) 


where  /  >  1  is  a  predefined  parameter  that  controls  the 
fuzzyness.  For  instance,  increasing  /  leads  to  crisper  cluster¬ 
ings  whereas  /  close  to  1  resembles  the  k-means  algorithm. 
Finally,  the  fuzzy  clustering  tries  to  minimize  the  following 
objective  function 

JL  (5) 


where  d(i,j)  is  the  distance  between  two  documents  di  and 
dj.  The  algorithm  tries  to  minimize  the  inter-cluster  similar¬ 
ity  while  minimizing  the  intra-cluster  variance.  It  converges 
to  a  locally  optimal  solution  [2]. 

We  use  the  output  of  our  trained  logistic  regression  clas¬ 
sifier  on  the  document-pair  features,  as  described  above,  to 
approximate  this  distance  metric,  d(i,j).  Although  this  is 
not  a  proper  metric  in  the  mathematical  sense,  it  can  be 
used  by  the  presented  clustering  algorithm  and  it  does  cap¬ 
ture  the  feature-weighted  similarity  used  in  the  relevance 
classification  component  of  our  system. 

Because  our  re-ranking  system  does  not  use  non-relevant 
feedback  documents,  we  want  to  select  documents  that  are 
likely  to  be  relevant  as  well  as  diverse.  The  classification 
scheme  described  in  Section  2.3  requires  judged  relevant  doc¬ 
uments  to  make  predictions  on  the  unjudged  documents  dur¬ 
ing  testing.  Initial  investigation  with  the  TREC  2008  Rel¬ 
evance  Feedback  data  indicated  that  increasing  the  number 
of  judged  relevant  documents  is  quite  beneficial  to  the  final 
re-ranking  performance.  Therefore,  our  aim  is  to  identify 
the  potentially  relevant  documents  while  maintaining  a  de¬ 
gree  of  diversity  among  them.  Assuming  the  baseline  indri 
ranking  is  well-tuned  and  relatively  accurate,  it  is  reasonable 
to  consider  the  top  documents  to  be  judged.  After  we  build 
the  clusters  among  unjudged  documents,  we  choose  the  top 
ranked  document  in  each  cluster  to  be  judged.  This  simple 
method  has  the  two  characteristics  we  require:  1)  it  consists 
of  top  ranked  documents  that  are  likely  to  be  relevant,  and 
2)  it  is  a  diverse  set  that  leverages  the  underlying  relevance 
distribution. 


3.  EXPERIMENTS 

This  section  describes  the  experiments  conducted  for  the 
TREC  2009  relevance  feedback  track. 


3.1  Training 

The  document  selection  and  relevance  classification  com¬ 
ponents  require  training  data  in  order  to  learn  weights  on  the 
features  described  in  Section  2.2.2  for  use  in  the  logistic  re¬ 
gression  relevance  classifier  (Section  2.3)  and  the  clustering 
algorithm  (Section  2.5).  Because  previous  queries  and  rele¬ 
vance  judgements  do  not  exist  on  the  ClueWeb09  dataset,  we 
built  our  training  data  from  previous  years’  TREC  ad-hoc 
tasks  using  the  GOV2  collection.  This  training  set  includes 
all  relevance  judgements  for  queries  701-850  excluding  those 
queries  with  no  relevant  documents.  The  final  constructed 
training  set  includes  1.8  million  document  pairs,  with  31% 
positive  examples  (relevant /relevant  pairs)  and  69%)  nega¬ 
tive  examples  (relevant/non-relevant  pairs).  Although  these 
two  document  collections  are  somewhat  different,  the  fea¬ 
ture  set  described  above  can  be  generated  on  both  collec¬ 
tions.  We  make  the  assumption  for  these  experiments  that 
the  feature  weights  learned  on  the  GOV2  collection  are  sim¬ 
ilarly  effective  on  the  ClueWeb09  collection. 

3.1.1  Features  Weights 

Sections  2.2  and  2.3  describe  the  pairwise  document  rep¬ 
resentation  and  how  we  use  this  representation  in  a  logistic 
regression  classifier  to  predict  the  relevance  level  of  an  un¬ 
judged  document  given  a  judged  relevant  document.  It  is 
informative  to  inspect  the  learned  logistic  regression  weights 
for  each  of  the  features  used  in  our  model,  as  the  larger  mag¬ 
nitude  weights  indicates  a  more  influential  feature.  Figure 
1  shows  the  absolute  weights  of  all  the  features  learned  in 
the  logistic  regression  model.  We  can  see  that  the  most  in- 
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Figure  1:  Learned  Feature  Weights. 

fluential  features  in  our  model  are  the  URL-based  features, 
particularly  the  similarity  of  the  host  name  and  path  por¬ 
tions  of  the  URL.  The  next  most  powerful  features  are  the 
components  of  the  baseline  Dependence  Model  query  —  the 
ordered  and  unordered  window  scores  assigned  by  Indri.  The 
Out-link  count  feature  is  the  only  webgraph  feature  that 
is  at  all  influential  in  the  model.  This  feature  is  derived 


exclusively  from  the  content  of  the  page  (just  the  count  of 
anchors),  rather  than  relations  between  documents  in  the 
collection.  This  may  be  an  indication  that  the  GOV2  we¬ 
bgraph  used  for  training  may  be  too  sparse  to  effectively 
estimate  the  other  webgraph  features  which  rely  on  linking 
among  documents  in  the  collection. 

3.1.2  Document  selection 
In  this  section,  we  analyze  the  quality  of  our  document 
selection  mechanism  across  queries.  First,  looking  at  the 
distribution  of  ranks  in  our  baseline  retrieval  selected  for 
judgement,  we  can  see  a  strong  skew  towards  the  top-ranked 
documents  to  be  selected  for  judgement.  We  also  see  that 
we  do  a  reasonably  good  job  of  finding  relevant  documents 
not  only  at  high  ranks  but  also  at  lower  ranks,  though  with 
decreasing  frequency.  This  is  especially  useful  since  it  de¬ 
tects  the  relevant  documents  the  baseline  ranker  misjudged 
by  putting  in  lower  ranks.  Incorporating  such  documents  to 
the  rank  learner  is  likely  to  lead  to  improvements. 
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Figure  2:  Rank  distribution  of  selected  documents, 
and  judged  relevant  documents. 

To  evaluate  the  quality  of  our  phase- 1  document  selec¬ 
tion  (CMU.l),  we  primarily  consider  the  fraction  of  other 
inputs  that  our  phase- 1  input  performed  better  than,  which 
we  refer  to  as  the  score  here.  (This  score  was  computed 
and  distributed  by  the  track  organizers.)  The  score  value  is 
intended  to  measure  the  general  quality  of  the  selected  doc¬ 
uments  across  a  variety  of  systems  that  use  this  feedback  as 
input.  A  higher  value  indicates  the  documents  selected  by 
our  phase  1  system  tended  to  be  more  useful  that  document 
selected  by  other  phase  1  systems.  The  score  is  calculated 
on  a  per-query  basis,  and  we  evaluate  the  correlation  across 
queries  with  various  other  measures.  These  measures  are 
described  below: 

1.  Mean  Rank:  The  mean  rank  in  our  baseline  rank¬ 
ing  of  the  documents  selected  in  our  phase  1  selection 
(CMU.l). 


2.  Max  Rank:  The  max  rank  in  our  baseline  ranking 
of  the  documents  selected  in  our  phase  1  selection 
(CMU.l). 

3.  Num.  Relevant:  The  number  of  documents  selected 
by  CMU.l  judged  relevant  for  the  query. 

Table  1  shows  the  mean  and  the  standard  deviation  of 
these  measures  and  their  correlations  with  the  score,  all 
computed  across  queries.  There  is  not  a  strong  correlation 
between  the  score  value  and  any  of  the  other  performance 
measures  computed  over  our  document  selection  set. 


Measure 

Mean 

Std. 

Correlation  with  score 

score 

0.525 

0.152 

— 

Mean  Rank 

10.24 

5.85 

0.139 

Max  Rank 

30.0 

19.59 

0.115 

Num  Relevant 

2.42 

1.26 

-0.030 

Table  1:  Document  selection  statistics  and  correla¬ 
tions  with  the  score 

3.1.3  Phase  2  Performance 

Our  document  selection  component  was  designed  to  iden¬ 
tify  documents  useful  for  our  relevance  classifier  and  re- 
ranking  components.  For  this  reason,  another  appropriate 
method  of  evaluating  the  quality  of  our  phase  1  input  is  to 
compare  the  relative  improvement  in  phase  2  performance 
using  our  phase  1  input  and  other  phase  1  inputs.  Figure 
3  shows  this  relative  improvement  as  a  function  of  the  total 
number  of  relevant  documents  selected  by  that  phase  1  in¬ 
put.  For  each  input  set,  we  compute  the  statMAP  on  the 
baseline  and  phase  2  run  excluding  those  documents  in  the 
input  set  from  each  evaluation  (i.e.  residual  performance). 
The  relative  improvement  of  a  phase-2  run  over  the  baseline 
is  referred  to  as  the  relative  residual  performance  improve¬ 
ment  and  is  used  as  our  primary  measure  to  evaluate  phase-2 
performance. 

There  is  a  strong  correlation  between  the  number  of  rel¬ 
evant  documents  selected  and  the  relative  improvement  in 
statMAP  (Pearsons’s  correlation  of  0.926).  This  is  likely  due 
to  our  phase  2  system  ignoring  non-relevant  feedback  docu¬ 
ments,  and  suggests  that  focusing  only  on  relevant  feedback 
is  not  always  an  appropriate  strategy. 

We  also  see  that,  although  our  phase  1  selection  system 
is  moderately  coupled  with  the  phase  2  re-ranking  system, 
it  doesn’t  yield  the  best  relative  improvement  in  statMAP. 
These  results  clearly  indicate  that  for  our  phase  2  system, 
increasing  the  number  of  relevant  documents  selected  for 
feedback  is  an  effective  strategy  for  improving  performance. 

Looking  deeper  at  the  robustness  of  our  phase-2  perfor¬ 
mance  as  a  function  of  feedback  documents,  we  evaluate  the 
relative  residual  performance  for  all  input  sets  as  we  vary 
the  wight  given  to  the  feedback  documents.  Figure  4  shows 
the  relative  residual  performance  for  each  of  our  system’s 
input  sets  as  the  weight  on  feedback  documents  varies  from 
0  to  1.  The  vertical  line  in  this  figure  indicates  the  weight 
we  used  in  our  TREC  submission  (0.3)  and  the  values  along 
this  vertical  line  correspond  to  those  plotted  in  Figure  3.  We 
can  see  that  the  weight  selected  based  on  our  training  data 
is  not  optimal  for  all  of  the  input  sets,  but  does  represent 
a  reasonable  tradeoff  across  the  different  inputs.  The  best 
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Figure  3:  Relative  residual  performance  improve¬ 
ment  in  statMAP  over  our  baseline  vs.  number  of 
relevant  documents  found  in  the  input  set.  Each 
point  represents  a  unique  input  set,  and  our  phase- 
1  input  (CMU.l)  is  shown  in  black. 

performing  input  set  (CMIC.l)  could  have  achieved  almost 
a  13%  improvement  in  residual  statMAP  had  we  selected 
a  lower  weight,  but  for  most  input  sets  the  selected  value 
is  within  2%  relative  residual  performance  of  the  optimal 
weight. 

Interestingly,  the  CMIC.l  input  set,  which  yielded  our 
best  relative  increase  in  statMAP,  almost  exclusively  con¬ 
sists  of  documents  from  Wikipedia4,  whereas  all  of  the  other 
input  sets  consist  of  less  than  5%  Wikipedia  documents.  Al¬ 
though  documents  from  Wikipedia  may  tend  to  be  of  higher 
general  quality  with  less  spam,  these  documents  may  be  less 
diverse  especially  with  regard  to  our  link-based  and  URL- 
based  document  pair  features.  This  result  is  somewhat  con¬ 
trary  to  the  hypothesis  that  drove  our  document  selection 
algorithm,  that  a  diverse  set  of  documents  with  respect  to 
our  feature  space  woud  be  most  beneficial  in  final  re-ranking 
performance. 

4.  CONCLUSION 

In  this  year’s  submission  to  the  TREC  Relevance  Feed¬ 
back  track,  we  took  a  machine  learning  approach  to  both 
the  phase  1  (document  selection)  and  phase  2  (document  re¬ 
ranking)  components  of  our  system.  These  two  systems  use 
a  shared  feature  space  to  represent  pairs  of  documents.  Our 
system  specifically  tried  to  leverage  non-textual  information 
such  as  webgrapli  features  and  URL  similarity  features,  as 
well  as  textual  features  such  as  scores  generated  from  dif¬ 
ferent  components  of  the  baseline  query.  The  shared  rep¬ 
resentation  moderately  couples  our  selection  and  re-ranking 
systems,  enabling  us  to  select  a  set  of  documents  specifi¬ 
cally  deemed  to  be  useful  for  the  down-stream  re-ranking 

4http ://en. Wikipedia . org 
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Figure  4:  Relative  residual  statMAP  for  each  input 
set  as  feedback  document  weight  increases. 


component. 

Initial  analysis  suggests  that  phase  1  selection  algorithms 
that  identify  more  relevant  documents  yield  a  higher  relative 
increase  in  performance  for  our  phase  2  re-ranking  system. 
Although  our  phase  1  selection  system  performed  well,  yield¬ 
ing  almost  an  8.5%  relative  improvement  in  statMAP,  higher 
relative  improvement  was  achieved  by  several  other  phase  1 
inputs  which  did  not  share  the  same  feature  space.  For  this 
reason,  it  is  not  clear  that  coupling  the  representation  used 
in  our  phase  1  and  phase  2  systems  yielded  a  significant  per¬ 
formance  boost.  Further  analysis  is  necessary  to  understand 
the  effect  of  coupling  these  two  systems. 

One  of  the  goals  of  the  phase  1  selection  system  was  to 
identify  a  diverse  set  of  relevant  documents  by  clustering 
the  top-ranked  documents  from  the  baseline  retrieval.  This 
clustering  was  performed  in  the  same  feature  space  used  by 
the  relevance  classification  component  (Section  2.3)  in  an 
effort  to  couple  the  two  systems.  To  evaluate  the  effect  of 
this  coupling,  future  work  should  assess  the  performance  of 
other  selection  mechanisms  that  aim  to  identify  diverse  doc¬ 
uments,  but  not  necessarily  within  the  same  feature  space. 
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